I want to begin by expressing my deep appreciation to the Nominating and Executive Committees and the Association at large for this tremendous honor. I am especially proud to join my collaborators and friends who previously received the Lifetime Achievement Award: Bill Woods, Martin Kay, Lauri Karttunen, and Joan Bresnan. They have profoundly influenced the contributions for which I think I am being recognized today, as have many other colleagues and students that I have worked with closely over the years.I will follow the tradition set down by most previous LTA recipients by describing how events in my personal history propelled me toward the concepts, theories, and algorithms I have helped to develop and have become known for. This historical perspective is quite different from a paper setting forth some current technical results.Some former recipients, Martin Kay for example, were intrigued by language from a very early age. That was not the case for me. I started as an undergraduate at the University of California Berkeley with a major concentration in physics. I decided at one point that physical reality was a little too messy, and I also discovered that I was not interested in spending long hours in a laboratory. So after a year or two I switched to math. That carried me forward to the middle of my fourth and final year, when I began to contemplate a future as a mathematician. That also seemed a little dry and unappealing. I thought I should look around, even at that late date, and maybe find a specialization that was a little more social and, for me, a little more engaging.I mentioned this to a friend during a brief ride in an elevator. He told me about a professor in the Psychology Department who was offering an individual major in what he called “language behavior.” The major wasn’t actually inside a department because it combined material from a bunch of separate disciplines: psychology, linguistics, philosophy, anthropology, and maybe even mathematics. That sounded pretty interesting, and an important property at that late stage of my undergraduate career was that it had no official prerequisite courses.The professor, Dan Slobin, was an expert in child language acquisition, and, as I later learned, child language acquisition played a prominent ideological role in the newly popular approach to linguistics, transformational grammar. I looked at some of the courses that other students had put into their individual-major schedules and at some of the core readings in psycholinguistics that they had included. The field was brand new, it didn’t have the long history of other disciplines, and therefore not a lot of material to be mastered before jumping in. And I realized that this was a neat application of formal mathematical methods to a collection of humanistic problems. All to the good—I signed up with Dan.I also decided to take the few remaining math courses that I needed to complete that major as well. So at that point I guess I had become, if only by accident, a mathematical psycholinguist. I was awarded a double-major undergraduate degree, and I was admitted to graduate school in social psychology, which is where psycholinguistics was located at Harvard.I encountered a few high-level ideas during that undergraduate period that have reverberated throughout the course of my career. Chomsky (1965) Aspects of the Theory of Syntax had appeared just a few years before. It provided some formal details of the first main revision to the basic machinery of transformational grammar, but that wasn’t what struck me. Rather, there was a simple statement at the beginning that set forth a particular view of the relation between a speaker’s knowledge of language, the result of learning a language, and the ability to put that knowledge to use in comprehension and production. He said:No doubt, a reasonable model of language use will incorporate, as a basic component, the generative grammar that expresses the speaker-hearer’s knowledge of the language; but this generative grammar does not, in itself, prescribe the character or functioning of a perceptual model or a model of speech production. (Chomsky 1965, page 9)This is a clear statement of what became known as the Competence Hypothesis. I understood it as making two claims: First, that a representation of linguistic knowledge should play a causal role in the mental processes of everyday communication by language. And second, that linguistic knowledge was only one of many components of a larger model of language performance.The first claim, I thought, distinguished this application of formal descriptions from the way that mathematical characterizations are deployed in most other sciences. The sun, for example, behaves according to the partial differential equations for energy transport, but presumably it does not solve a representation of those equations in order to decide what to do next. But the proposition here is that the psychological system encodes and then interprets something that correlates with generalizations about the properties of language that emerge from linguistic investigation. That would be a very neat state of affairs, if it were true.The second claim places this in the context of another powerful conception, the notion of a “nearly decomposable system” that Herb Simon put forward in a short essay entitled The Architecture of Complexity (Simon 1962). Simon argued that apparently complicated systems, if created artificially or through an evolutionary process, are likely to be recursively decomposable into collections of separate components. Such systems appear complicated because these components interact in non-negligible ways. But those interactions are relatively simple compared to the interaction of elements within the individual components. It is thus fruitful, Simon would say, to construct scientific explanations of such systems by treating each of the components independently at first, and then layering on top an account of their interactions. Chomsky’s statement was consistent with this kind of decomposable architecture, although he never got around to the layering-on-top part. That became the focus of some of my work.You will have noticed that my story so far has not included a mention of computers or computation. I had been involved in computing during my physics period—I had a summer job computing the fluorescent signature of a high-altitude nuclear explosion, very topical in those days. And I had taken numerical methods courses in the math curriculum. But there was as yet no connection between computing and language behavior.Because I had finished up at an odd time, in the spring of 1968, there was a six-month interlude between leaving Berkeley and starting up at Harvard. I went home to Los Angeles to look for a temporary job. In another chance encounter, another friend with connections to the Rand Corporation told me that Rand had a linguistic group that maybe I could hook up with. I sent in my papers and went for an interview. I found that they did have an interest in mathematical psycholinguistics, of all things, and that I could come on as a six-month intern.Dave Hays was the head of a group that included Martin Kay, and Lauri Karttunen was there just finishing his dissertation. You may know that Dave was one of the founders of our field. He coined the term “computational linguistics,” helped to create the American precursor organization of the ACL, and was involved in setting up the international coling conferences. He was also a very smart man, in many ways ahead of his time. He was an early advocate for dependency grammar, for example, as a counter to transformational grammar and immediate constituent approaches (Hays 1964). And in my case I remember that he strongly urged me to learn about regular languages and finite-state machines. Having read Syntactic Structures (Chomsky 1957), I knew of course that regular languages were totally uninteresting. And coming from Berkeley I thought that the acronym FSM stood for Free Speech Movement. But I did try to learn at least a little bit.For my first project, Dave sent me down the hall to work with Lauri on some semantic issues. As I recall, our computational task was to implement a semantic network within a Rand time-sharing system. I was also involved in Lauri’s linguistic investigation into some properties of noun-phrase reference—the contexts in which definite and indefinite descriptions can be properly used. Semantics turned out not to be so hard. I went back to Dave after a few months and told him that we had finished that project. We now wanted to add on a front end to make it easier to interact with our system.He told me to go down the hall in the other direction and see what I could get from Martin Kay. Martin had developed a parsing program that was not based on transformational grammar but seemed expressive enough to recognize the patterns of natural language (Kay 1967). It was a system for rewriting sequences of trees into sequences of trees, with Turing machine power. But it used a very efficient data structure, called the chart, to hold the intermediate state of the computation. There was a missing ingredient, however: Martin had the program, but he didn’t yet have an English grammar to drive it. That became my project for the rest of the summer, and in some sense, for the rest of my life. That was how I learned about parsing; that was my entry into computational linguistics.The summer came to an end and I reverted to my psycholinguistic studies at Harvard. My advisor was Roger Brown, probably the pre-eminent figure in child language acquisition at the time, and language acquisition was still a focus particularly at Harvard. As a first experiment, I decided to see whether two-year-olds were able to make some of the definite/indefinite reference distinctions that Lauri and I had worked on. The experiment failed—the children ignored both what I said to them and the rubber ducks that I gave them to play with. On the basis of that experience I decided I didn’t have the patience or interest needed to work with children. I shifted to adult comprehension.As per Chomsky’s suggestion in Aspects, psycholinguists had investigated a theory of comprehension that incorporated transformational grammar as its repository of linguistic knowledge. They tested the simple assumption that, if that were the case, the complexity of sentence comprehension ought to be proportional to the number of grammatical transformations that applied in the linguistic derivation of a sentence. This idea, called the Derivational Theory of Complexity, received some initial empirical support. The measurement techniques of that era were crude at best, but it appeared in early studies that passive and negative sentences, involving two transformations, were more difficult for people to deal with than simple declaratives (Miller and McKean 1964). Later experiments, however, showed the opposite effect: application of an additional transformation had the effect of reducing psychological complexity. Slobin (1968), for example, observed this in the case of the transformation that deletes the agent by-phrase of a passive sentence. The sentence gets shorter, and easier to process.By the late 1960s psycholinguists had more or less given up on the Derivational Theory of Complexity. In fact, they had basically given up on Chomsky’s suggestion that a model of performance would incorporate a representation of the native speaker’s linguist knowledge. They began to operate independently of linguistics, to form heuristic models of language behavior by generalizing just from experimental results, to summarize experimental results in a collection of what they called “perceptual strategies.” One strategy, that the first Noun-Verb-Noun sequence in an English sentence is taken as the Agent-Action-Object, worked for simple actives but not for passives, thus accounting for the fact that passives seem to be harder to process. The problem, of course, was that this did not explain how passives could be understood at all—there was no obvious back-off when a strategy failed to apply.Although the Derivational Theory of Complexity did not work out, retreating to a collection of disconnected perceptual strategies seemed like an overly weak response. The problem, I thought, was not in the general conception laid out in the Competence Hypothesis that systematic knowledge of language should be incorporated in performance models. Rather, the mistake was in choosing a grammatical framework, the transformational formalism, whose processing characteristics were so obviously unrealistic. Mainstream linguistics was locked onto that formalism, but I knew from my work with Martin’s parser that there might be other sufficiently expressive alternatives and perhaps some that are psychologically more plausible.My reputation as a grammar engineer somehow followed me from Rand to Cambridge. A project was starting up at Bolt Beranek and Newman to build a natural language interface that would enable lunar geologists to access geophysical data on the moon rocks that had been brought back from the Apollo missions. Bill Woods was heading the project, and one of the system’s key components was the Augmented Transition Network parser that he had developed (Woods 1970).As is well known, an ATN grammar, as illustrated in Figure 1, consists of a collection of finite-state transition networks, each of which describes how a phrase of a given category (S, NP, etc.) can be realized, and a recursive control that allows for one phrase to be embedded in another.The transitions are augmented with actions that can store information in local variables, called registers, and registers are passed forward to condition what may happen on later transitions. Whenever a network traversal reaches a final state, information in the local registers is used to build a tree intended to imitate the deep structure that a linguistically motivated transformational grammar might assign to that constituent.I took over the task of extending an English grammar written in the ATN notation, and that was my contribution to the first version of the Lunar question-answering system (Woods and Kaplan 1971). Bonnie Webber joined us and extended the semantics and other features for the second version (Woods, Kaplan, and Nash-Webber 1972). Lunar and its ATN parser and grammar are described at greater length in Bill’s LTA acceptance paper (Woods 2006) and elsewhere. The coverage and accuracy of the Lunar grammar demonstrated the power of the ATN formalism to provide relatively simple descriptions of complicated English sentences.The ATN formalism also had psycholinguistic potential. In the natural order of execution for the transformational formalism, the entire input string is scanned once for every transformation. That seemed psychologically implausible on its face. To the extent that a transformation encodes separate syntactic generalizations, this direct interpretation pits the needs of the language learner—presumably to identify as many independent generalizations as possible—against the needs of the language understander—to focus on the words at hand. In contrast, the natural order of execution is reversed in the ATN set up. Discounting failures resulting from incorrect heuristic choices, the ATN parser makes a single pass through the surface phrase structure of the input string, applying only the grammatical options and constraints that are relevant at each phrasal position. This seemed to comport better with the view that the human language faculty is an evolutionary compromise between the requirements that languages be easy to learn, easy to produce, and easy to comprehend.I wrote an early paper (Kaplan 1972) describing how the ATN parser and a simple ATN grammar could be organized to model the complexity predictions of some of the proposed perceptual strategies. I showed that those strategies could be simulated by carefully fixing the order of arcs leaving each state and using depth-first backtracking for recovery when a chosen arc does not lead to a successful traversal.My colleagues Eric Wanner, Mike Maratsos, and I investigated the ATN modeling approach in a different way. The relative-clause sentences in Example (1) are equal in terms of their grammatical complexity and they both express essentially the same predicate argument relations. But clearly they differ in their psychological complexity.The ATN uses some special actions and a special memory, the so-called hold-list, to handle the long-distance filler-gap dependencies in relative clauses and questions. The different nouns in Example (1a) get stacked up in that memory, perhaps overloading it, while in Example (1b) the early noun is removed before the later one is added. We proposed that the hold-list is a particularly costly psychological resource and set out to test what we called the “hold hypothesis”.One set of experiments was enough to get me a Ph.D. The sentences in Example (2) are the same except for the relative clause verbs:By virtue of the different lexical properties of told and failed, the hold-list is empty at the word stop in Example (2a) but still contains a representation of the driver in Example (2b). If the hold-list memory is expensive, processing load should be higher at stop in Example (2b) than Example (2a) and the difference should die out at exceeded. As shown in Figure 2, this prediction was borne out using a reaction-time monitoring task as a measure of transient processing load and a variant of these sentences with more words in the region of contrast (Kaplan 1974). Wanner and Maratsos (1978) found further support for this hypothesis with other measures and a different syntactic contrast.In my thesis I also articulated a particular architecture for performance models with components that were relatively independent and could be studied separately, along the lines of (Simon 1962). Such a model would include • a grammar in some formalism, as a repository of linguistic knowledge,• a processor that allocates computational resources to interpret that formalism,• an agenda of order and preference specifications to govern a presumably nondeterministic search, and• a vector of coefficients that map processor resources into predictions of cognitive load.The individual components should be evaluated as to how well they fit into an overall model, but they should also be evaluated as explanatory artifacts in their own domains. The grammar as a repository of linguistic knowledge should be a revealing encoding of linguistic generalizations, while the processor should admit of a well-engineered implementation.A combination of linguistic and implementation considerations led me to propose some basic changes to Bill’s ATN framework. As I mentioned, the internal state of an ATN network traversal was determined by mini-procedures on the arcs and recorded in a collection of local registers. These were converted to deep-structure trees at the end of a valid path, in imitation of conventional linguistic theories. I noticed that information was often lost in this conversion, because it was overlooked in the specification or because there was no natural place in a tree for some of the features to reside. Moreover, the parser contained extra routines for constructing trees from registers at the end of each traversal, and additional tree-walking code to inspect those trees at higher levels of recursion.Information loss could be avoided and many subroutines could be eliminated by changing the underlying representation produced at each level of recursion and for the entire input string. Instead of converting to deep tree structures, I proposed (Kaplan 1974) simply taking the collection of registers, encoded as a hierarchical attribute-value list, as the output of the parsing process. Figure 3 illustrates the migration from a deep-structure tree to an attribute-value representation for a passive sentence with an embedded relative clause (Example (3)).(3)The rat that bit the dog was eaten by the catAttribute-value matrices of this type are the original, primitive representations of what became the functional structures of Lexical Functional Grammar and the feature structures of other unification formalisms.A second observation motivated a substantial reduction in the inventory of actions and conditions that could be used to set and test the values of registers during a network traversal. A parade example of ATN elegance is the customary way of relating the passive sentence in Example (3) to its corresponding active in Example (4). (4)The cat ate the rat that bit the dog.Roughly, for both the active and passive the top-level sentence network first guesses that the initial NP is the subject and the immediately following verb is the action. This guess is corroborated when the verb is followed by an object NP, as in Example (4). If instead a participle like eaten is encountered after was, as in Example (3), the network branches to an alternative path set up to handle passives. On that path the register contents are rearranged so that the participle becomes the action, what was initially guessed as the subject becomes the object, and the NP after by becomes the final subject, correcting the initial guess. In the end the active and passive sentences are assigned essentially the same attribute-value structures with the subject and object values denoting the logical arguments of the action eat, as shown in Figure 3. An attractive aspect of this analysis is that it involves no backtracking over strings or recomputation of phrases, just the ability to modify register values. The perceptual strategy complexity difference could be attributed to this extra fix-up computation.This simple solution breaks down in the face of tag questions:(5)a.Mary kissed John, didn’t she?b.John was kissed by Mary, wasn’t he?The pronoun in the tag agrees with the original surface subject, not the logical subject, whether the main clause is active or passive, and the auxiliary verb in the tag is determined by the initial verb of the main clause. The initial settings of those registers must be preserved, if the tag is to be analyzed correctly. The motivation for the original ATN register operations is further weakened by the grammaticality contrast in Example (6). (6)a.The sheep that eats grass runs.b.*The sheep that eats grass run.The verbs of the main and relative clauses must match in number when the first noun-phrase is understood as the subject of both clauses. If the number of the first noun is explicitly marked, this would follow indirectly from the local agreement of each verb with its own subject in the normal left-to-right order of register calculations (the svagr condition in Figure 1). But the number of sheep is naturally unspecified and there is no common value for the verbs to agree with separately through standard register operations.I concluded from these and other examples that the set of register actions and conditions could and should be replaced by one composite operation defined to merge two attribute-value structures provided they do not contain conflicting values, and that that operation must be transitive and order-free. I called this the same predicate: It asserts that two structures have exactly the same attributes and values. This was the seed of the equality predicate of LFG and the unification operator in Martin’s Functional Unification Grammar (Kay 1979, 1984) and other formalisms in that tradition.My early ATN account of perceptual strategies relied on a carefully tuned, fixed ordering of arcs leaving each state in the grammar. Mehler and Carey (1967) conducted a really quite trivial experiment that suggested that this was not a good arrangement. Still thinking about the psychological reality of transformational grammar, they wanted to see whether listeners were sensitive to differences in surface-structure grouping, as in Example (7a,b), and differences in deep-structure relations (Example (7c,d)). (7)a.They [are forecasting] cyclones.b.They are [conflicting desires].c.They are [conflicting desires].(= Someone embraces them.)d.They are hesitant to travel.(= They travel.)Without getting into the details of their primitive experiment, Mehler and Carey found that a sentence of one type (say Example (7a)) is easier to process after presenting ten sentences of the same type rather than priming with ten sentences of the contrasting type (Example (7b)). It seems that comprehension strategies can be influenced by very immediate, short-term syntactic experience, yet we would not want to say that the listener’s knowledge of English is so unstable and changes so quickly. Rather, preference order should not be embedded in the grammar but encoded as an independent and perhaps rapidly fluctuating overlay. This is the separate agenda component of the modeling architecture I have sketched. As an aside, I think such short-term variability poses a conceptual challenge for psycholinguistic models that might be based on modern probabilistic and deep learning approaches, if preferences and structural constraints are tightly coupled or even inseparable.My original account also depended on the default top–down depth-first search policy of the ATN parser. Back-tracking from a failed hypothesis could be very expensive, and fluctuating preferences might increase the likelihood that subsequent choices would be incorrect. It was well known that cubic-time performance could be achieved for any reasonable context-free parser simply by incorporating a memory for previously analyzed constituents and partial constituents. That is a very good engineering trade-off: quadratic space for exponential time. It seemed reasonable to think that the psychological system would also make such a trade-off, to mitigate the damaging effect of incorrect heuristic choices. I stripped down the ATN parser to its bare essentials and then combined it with Martin’s chart data structure to represent the well-formed partial constituents. Martin and I later developed a simplified and more abstract implementation of this idea, which we later called an “active-chart parser.” He describes this evolution also in his LTA acceptance paper (Kay 2006).That was the end of my stint at Harvard. I have mentioned a few core ideas that emerged from these psycholinguistic and computational considerations: the compositional architecture for performance models, with separate knowledge, process, and strategy components; hierarchical attribute-value structures as an underlying representation for syntax; a single same operation for imposing constraints on attributes and values; and the integration of memory for previously recognized constituents. These ideas are the background for much of my later research.I met Danny Bobrow while working on the Lunar project. He was the vice president for Artificial Intelligence at BBN, and nominally Bill’s boss. He left BBN to create a language understanding group at Xerox’s new research center in Palo Alto. He invited me to join the group when it looked like I was about to finish my degree. Being a little wary of industry, I asked him how long we would have before Xerox expected us to impact their products. He replied that we would be one of their long-term investments, and that we had a 15-year horizon. I thought I could manage that. I suggested that he also invite Martin Kay, so that we could continue our productive collaboration. We arrived at the same time, in the fall of 1974. I expected to carry forward with psycholinguistic modeling at parc, but I found that it was much easier to have an impact in linguistic theory and computational linguistics. Experimental psychology is hard.As the language understanding project started up, we set to work on different components of an overall system. Danny and Terry Winograd worked on knowledge and reasoning, while Martin and I focused on syntax and morphology. This was all in service of developing a Conversational User Interface, in contrast to the graphical interface that others at parc were exploring. We expected to assemble the separate components into a prototype system, after a few years of work. We eventually did lash the components together in, again, a Simon-esque way. A student, Henry Thompson, volunteered to deal with the non-negligible interactions of components so that we could protect the simplicity of our individual modules. The result was GUS, the early mixed-initiative frame-based dialog system described by Bobrow et al. (1977).Martin and I had the luxury at parc of being able to examine fundamental computational and linguistic issues over several years without otherwise showing much incremental progress. We were lucky that our work was not being peer-reviewed and that we did not have to apply for grants. Although ATNs had become a standard platform for many natural language applications, Martin and I argued at great length to identify and converge on the most primitive and abstract operations for parsing. And we had long, stimulating, and enjoyable debates about the nature of grammatical formalisms and syntactic representations. We agreed on attribute-value matrices as the basic encoding of underlying structure and something akin to the same predicate as the primitive descriptive device. We found further computational motivation for same: Its order-free property made it easy to configure an active-chart parser to implement bidirectional, island-driving parse strategies of the sort that were being proposed for some early approaches to speech recognition. But we had different and strongly held views about how to proceed from there.Roughly, Martin was attracted to the idea that the grammar itself could be coded as elaborated feature structures, and that parsing consisted of applying implicitly a same-like operator—unification—to combine the gr